Low Level Text Analytics in 7 Minutes

Low Level Text Analytics in 7 Minutes

From entity extraction to document summary, text analytics is a combination of machine learning and natural language processing.  Some companies use more of one vs. the other.  (e.g. some are all machine learning, and some are all “NLP” & rules).  These functions are often cast as arcana, understood only by that ever sexy order of wizards―the data scientists. But this doesn't have to be the case. In this quick, high-level overview we're going to drill a level deeper into text analysis. As always, I'll be using my company, Lexalytics, as an example.  We believe that its important to have transparency on the inner workings of text analytics.

In seven minutes you'll have a much better understanding of what text analytics is and how it works! 

If you wanna see these functions at work with no obligations check out our free web demo!

 Let's say you dump a load of Tweets, online reviews, and forum comments into a text analytics engine. The first thing that needs to happen is this unstructured text must be broken down before any analysis can occur. This is much like what we were taught to do as kids in language arts class.

 Tearing apart the documents is critical to return an accurate, reliable analysis of items like the entitiesthemes, and sentiment that we talked about in the previous post.

Identification

First, you need to know what language this text is in, as each language has its own pecadillos. Spanish? Singlish? Arabic? Lexalytics supports over 22 languages (shameless plug) spanning dozens of alphabets, abjads and logographies. So, basic as it might seem, this subfunction determines the course for the rest of any given analysis! It's very important to get it right. 

Tokenization

Now that we know what language the text is in, we can break it up into tokens. Tokenization is a necessary step in processing text, as it breaks the text apart into tokens that a machine can understand. We use the word "tokens" and not "words" because as well as words, tokens can also be things like:

  • punctuation (exclamation points affect sentiment, for instance)
  • links (https://...)
  • possessive markers

Now, as I said, tokenization is language specific. For most alphabetic languages, tokenization is straightforward—use white space and punctuation to denote a token. English is easy, cuz “spaces.”

Moving East, logographies (a fancy word for character based languages), such as simplified Chinese, have no space breaks between words and tokenizing those languages requires the use of machine learning. Each language has its own tokenization requirements. 

Sentence Breaking

Once you’ve identified the tokens, you can tell where the sentences end.  (See, look at that period right there, you knew exactly where the sentence ended, didn’t you, Dr. Smart?)  Now, do you see what I did there?  Did the sentence end at that period at the end of “Dr.?”  Now check out the punctuation in that sentence – there’s a period and a question mark right at the end of it.  Will the madness never end?  The point is this – you have to tell where the boundaries are on the sentences before you can figure out things like syntax.  Certain communication forms cough> Twitter cough> are less friendly than this post, and we have ways of making them work, but we’ll leave that aside for another time. 

PoS Tagging

Now we’ve identified the language, tokenized, and performed sentence breaking on the text, it’s time to PoS Tag it.  I have to admit that I giggle every time that I type “PoS,” but that’s my inner adolescent speaking.  Part of Speech tagging (or PoS tagging) is a tagging function tasked with determining the part of speech for every token in a sentence. So, in other words, is a given token a proper or common noun? Is it a verb or an adjective?  At Lexalytics, we support 93 separate PoS tags.   

Chunking

Okay, relatively painless so far, right? Let's plow on to the function known as Chunking, or light parsing as it's sometimes called. Chunking refers to a range of sentence-breaking systems that splinter a sentence into its component phrases (noun phrases, verb phrases, etc.). 

 I want to draw a quick distinction between chunking and PoS tagging before we go forward, because there aren't many answers to this on the internet. My definition goes like this:

  • PoS Tagging: Assigning parts of speech to tokens
  • Chunking: Assigning PoS tagged tokens to phrases

Here’s what it looks like when it works in practice. Take the sentence:

The tall man is going to quickly walk under the ladder.

The chunking processes will return: ?[the tall man]_np [is going to quickly walk]_vp [under the ladder]_pp

Where np stands for “noun phrase,” vp stands for “verb phrase,” and pp stands for “prepositional phrase.”

Syntax Parsing

We’re moving onto one of the most important steps in this process. Now, I don't wanna give you people flashbacks to high school, but the truth is syntax parsing is just fancy talk for sentence diagraming. In short, this subfunction determines the structure of each sentence. It is a seriously critical step if we intend to run a sentiment analysis on the text. This becomes clear in the following example:

  • Apple was doing poorly until Steve Jobs
  • Because Apple was doing poorly, Steve Jobs
  • Apple was doing poorly because Steve Jobs

 In the first example, Apple is negative, whereas Steve Jobs is positive. In the second, Apple is still negative, but Steve Jobs is now neutral. In the final example both Apple and Steve Jobs are negative.  This is one of the most computationally intensive steps, but we’ve developed special unsupervised machine learning based on billions of words of input and matrix factorization to help us, well, cheat.  Better put, to help us understand the syntax just like a human would.  (Again, this is beyond the scope of this article.)

 Okay, onward and upward!

Inter-Sentence Relationships

Now that you know everything you can about a single sentence, you need to actually relate sentences together so that you can assert sentiment from one sentence onto another. 

 One technique that Lexalytics uses to relate sentences together is called “chaining” --  “lexical chaining”  to be precise.  Lexical chaining links individual sentences to each other by each sentence’s strength of association to an overall topic. Even if sentences appear many paragraphs apart in a document, the chain will flow through the document and allows the overall “feel” to be detected and quantified by the machine.

Summary

Think about it like this:  Once you know what language it is, you can tell what words and punctuation are there.  Once you know what words and punctuation are there, you can tell what phrases are there.  Once you know what phrases are there, you can relate them and break them into sentences.  Last, but not least, you can relate the sentences together to form a logical thought.

Olga Tsubiks

Helping business grow through data science and innovation | Top 25 Women in AI in Canada

8 年

Text analytics is a complex subject, but you have managed to summarize it quite well.

Max T Russell

If it's not a story, it didn't happen.

8 年

A fun reading for me, Seth. And because I am always analyzing speech for different purposes in my work, I read posts like this with a lot of questions. For one example, so many circles of context are involved in understanding what a person has said and not said. I notice improvement in the performance of speech analysis, and I also notice how far there is to go.

要查看或添加评论,请登录

Seth Redmore的更多文章

社区洞察

其他会员也浏览了