Applying Natural Language Processing: How I Let Data Tell the Story
In the current, post-newspaper age, we are bombarded by the endless stream of web content. I find this quite encouraging because being up to date on current events has never been more accessible.
Thousands of articles and blog posts are added to the web each and every day. Truly, a daunting reading list for anyone. How do we "choose" what is "important" to read? (Or more technically: how does one parse and discriminate signals from noise within text?) A perfect exercise for the application of Natural Language Processing (NLP), a field of machine learning/artificial intelligence.
NLP is a challenging area in machine learning due to the complexity and nuance of human language. While text comprehension and generation is a nominal, everyday task to humans, machines have a much tougher time. So, how is a machine to learn what "important" means?
The first step in approaching any problem is to define the purpose: reducing the amount of material read while preserving an overall understanding of the article. The output is not intended to replace the article, but to add context to a headline, empowering the user with more information to make the decision of reading further into the article (clickbaiters, beware).
The model:
At a high level, this model takes a string, scores each sentence for its importance, and returns the top scoring records. The model was considered successful if it produced a snippet from a body of text that maintains coherence and is relevant to the main topic discussed. Unfortunately, this is not an exact quantitative measurement, yet (perhaps the singularity will fix this). The general process is outlined by the following:
Output example:
One tool in the data scientist’s NLP arsenal is called Term Frequency-Inverse Document Frequency or TF-IDF (wiki). This clever, old algorithm, devised in 1972, holistically analyzes text, assigns a weighting metric to each word (or token), and dumps them into a sparse matrix. Now we're working with numbers, a universal language, interpretable by computers and humans alike.
We could stop here, asking the machine to return the sentences that contain the highest number of “important” words. While this may be quick, it also proved to be dirty. I rapidly discovered that picking the low hanging fruit was impractical, often delivering outputs riddled with noise and dissimilar sentences. The hunt continues...
Developing a scoring method for discriminating against noise proved to be a fun, challenging mental exercise. After many iterations, I finally landed on two functions that, when used in conjunction, gave the most consistent, coherent group of sentences.
The first function is fairly simple, in that it uses the TF-IDF from scikit-learn to establish importance for key terms then scores the sentences based off of that importance metric. Two records were chosen in this manner.
The second function is based on a recommendation algorithm, specifically a modified version of the Jaccard index (wiki). Recommendations are made by evaluating a similarity coefficient for each record based on key terms from the output of the previous function, as opposed to the entire original string. This noticeably bolstered the results in terms of content extraction and noise reduction.
This model can be applied to condensing text articles, saving time and mental energy put towards parsing countless articles in pursuit of your everyday effort to be informed. Personally, I have deployed this to power a Slack chatbot that delivers blurbs from news articles; and, in the case it piques my interest, I am able to follow up with exploring the original.
Ryan Cohen is an aspiring data analyst/scientist and enjoys learning about and writing machine learning applications. Link to GitHub here.