登录查看更多内容

Low Level Text Analytics in 7 Minutes

Seth Redmore

发布日期: 2016年4月26日

From entity extraction to document summary, text analytics is a combination of machine learning and natural language processing. Some companies use more of one vs. the other. (e.g. some are all machine learning, and some are all “NLP” & rules). These functions are often cast as arcana, understood only by that ever sexy order of wizards―the data scientists. But this doesn't have to be the case. In this quick, high-level overview we're going to drill a level deeper into text analysis. As always, I'll be using my company, Lexalytics, as an example. We believe that its important to have transparency on the inner workings of text analytics.

In seven minutes you'll have a much better understanding of what text analytics is and how it works!

If you wanna see these functions at work with no obligations check out our free web demo!

Let's say you dump a load of Tweets, online reviews, and forum comments into a text analytics engine. The first thing that needs to happen is this unstructured text must be broken down before any analysis can occur. This is much like what we were taught to do as kids in language arts class.

Tearing apart the documents is critical to return an accurate, reliable analysis of items like the entities, themes, and sentiment that we talked about in the previous post.

Identification

First, you need to know what language this text is in, as each language has its own pecadillos. Spanish? Singlish? Arabic? Lexalytics supports over 22 languages (shameless plug) spanning dozens of alphabets, abjads and logographies. So, basic as it might seem, this subfunction determines the course for the rest of any given analysis! It's very important to get it right.

Tokenization

Now that we know what language the text is in, we can break it up into tokens. Tokenization is a necessary step in processing text, as it breaks the text apart into tokens that a machine can understand. We use the word "tokens" and not "words" because as well as words, tokens can also be things like:

punctuation (exclamation points affect sentiment, for instance)
links (https://...)
possessive markers

Now, as I said, tokenization is language specific. For most alphabetic languages, tokenization is straightforward—use white space and punctuation to denote a token. English is easy, cuz “spaces.”

Moving East, logographies (a fancy word for character based languages), such as simplified Chinese, have no space breaks between words and tokenizing those languages requires the use of machine learning. Each language has its own tokenization requirements.

Sentence Breaking

Once you’ve identified the tokens, you can tell where the sentences end. (See, look at that period right there, you knew exactly where the sentence ended, didn’t you, Dr. Smart?) Now, do you see what I did there? Did the sentence end at that period at the end of “Dr.?” Now check out the punctuation in that sentence – there’s a period and a question mark right at the end of it. Will the madness never end? The point is this – you have to tell where the boundaries are on the sentences before you can figure out things like syntax. Certain communication forms cough> Twitter cough> are less friendly than this post, and we have ways of making them work, but we’ll leave that aside for another time.

PoS Tagging

Now we’ve identified the language, tokenized, and performed sentence breaking on the text, it’s time to PoS Tag it. I have to admit that I giggle every time that I type “PoS,” but that’s my inner adolescent speaking. Part of Speech tagging (or PoS tagging) is a tagging function tasked with determining the part of speech for every token in a sentence. So, in other words, is a given token a proper or common noun? Is it a verb or an adjective? At Lexalytics, we support 93 separate PoS tags.

Chunking

Okay, relatively painless so far, right? Let's plow on to the function known as Chunking, or light parsing as it's sometimes called. Chunking refers to a range of sentence-breaking systems that splinter a sentence into its component phrases (noun phrases, verb phrases, etc.).

I want to draw a quick distinction between chunking and PoS tagging before we go forward, because there aren't many answers to this on the internet. My definition goes like this:

PoS Tagging: Assigning parts of speech to tokens
Chunking: Assigning PoS tagged tokens to phrases

Here’s what it looks like when it works in practice. Take the sentence:

The tall man is going to quickly walk under the ladder.

The chunking processes will return: ?[the tall man]_np [is going to quickly walk]_vp [under the ladder]_pp

Where np stands for “noun phrase,” vp stands for “verb phrase,” and pp stands for “prepositional phrase.”

Syntax Parsing

We’re moving onto one of the most important steps in this process. Now, I don't wanna give you people flashbacks to high school, but the truth is syntax parsing is just fancy talk for sentence diagraming. In short, this subfunction determines the structure of each sentence. It is a seriously critical step if we intend to run a sentiment analysis on the text. This becomes clear in the following example:

Apple was doing poorly until Steve Jobs
Because Apple was doing poorly, Steve Jobs
Apple was doing poorly because Steve Jobs

In the first example, Apple is negative, whereas Steve Jobs is positive. In the second, Apple is still negative, but Steve Jobs is now neutral. In the final example both Apple and Steve Jobs are negative. This is one of the most computationally intensive steps, but we’ve developed special unsupervised machine learning based on billions of words of input and matrix factorization to help us, well, cheat. Better put, to help us understand the syntax just like a human would. (Again, this is beyond the scope of this article.)

Okay, onward and upward!

Inter-Sentence Relationships

Now that you know everything you can about a single sentence, you need to actually relate sentences together so that you can assert sentiment from one sentence onto another.

One technique that Lexalytics uses to relate sentences together is called “chaining” -- “lexical chaining” to be precise. Lexical chaining links individual sentences to each other by each sentence’s strength of association to an overall topic. Even if sentences appear many paragraphs apart in a document, the chain will flow through the document and allows the overall “feel” to be detected and quantified by the machine.

Summary

Think about it like this: Once you know what language it is, you can tell what words and punctuation are there. Once you know what words and punctuation are there, you can tell what phrases are there. Once you know what phrases are there, you can relate them and break them into sentences. Last, but not least, you can relate the sentences together to form a logical thought.

Olga Tsubiks

Helping business grow through data science and innovation | Top 25 Women in AI in Canada

8 年

Text analytics is a complex subject, but you have managed to summarize it quite well.

1 次回应

Max T Russell

If it's not a story, it didn't happen.

8 年

A fun reading for me, Seth. And because I am always analyzing speech for different purposes in my work, I read posts like this with a lot of questions. For one example, so many circles of context are involved in understanding what a person has said and not said. I notice improvement in the performance of speech analysis, and I also notice how far there is to go.

1 次回应

查看更多评论

要查看或添加评论，请登录

Seth Redmore的更多文章

Airport Series: Charlotte and Customer Complaints

2018年4月9日

Airport Series: Charlotte and Customer Complaints

More than 40 million people travel through North Carolina’s Charlotte Douglas International Airport each year, and it…
I used Semantria to analyze all of Atlanta International Airport's Facebook data. Here's what I found.

2018年1月11日

I used Semantria to analyze all of Atlanta International Airport's Facebook data. Here's what I found.

So, here’s the thing: few people are happy when they’re in airports. Whether it’s for business or pleasure, packing…
Automation Armageddon – Fact or Fiction?

2017年9月19日

Automation Armageddon – Fact or Fiction?

Automation is transforming our economy. Job losses over the coming decades may be as high as 47 percent, some analysts…
9 ways AI isn’t going to be like Hollywood

2017年9月12日

9 ways AI isn’t going to be like Hollywood

When Hollywood isn’t doing comic book franchises, it’s doing AI. Why? Because AI gives us a window into our own souls…
My Team is Hosting an Awesome Webinar!

2016年5月13日

My Team is Hosting an Awesome Webinar!

We're going to be running a no nonsense, straight to the point webinar on Thursday, May 19th at 1:00-1:30PM EDT. It's…
Tay, the Teen Chatbot and Redmore’s Razor

2016年4月13日

Tay, the Teen Chatbot and Redmore’s Razor

When Microsoft launched an “artificial intelligence” chatbot, or Tay, with the personality of a teenage girl, on a…

2 条评论
NLP Explained In Five Minutes

2016年3月30日

NLP Explained In Five Minutes

The Foundation As you might be able to tell by now, I'm interested in where data analytics and marketing intersect. But…
Understanding Sentiment Analysis in 5 Minutes

2016年3月29日

Understanding Sentiment Analysis in 5 Minutes

Basics Alright, for starters: Sentiment Analysis is the process of determining whether a piece of text is positive…

3 条评论
The Royal Bank of Scotland versus The Vikings

2016年2月26日

The Royal Bank of Scotland versus The Vikings

Every year the world spins closer to streams, allowing consumers everywhere to individually curate the media they come…
Knowing Your Audience's Intent

2016年2月10日

Knowing Your Audience's Intent

Intention Extraction Before we do anything, we need to pin down exactly what it means to extract intention from…

See all articles

Low Level Text Analytics in 7 Minutes

Seth Redmore

Identification

Tokenization

Sentence Breaking

PoS Tagging

Chunking

Syntax Parsing

Inter-Sentence Relationships

Summary

Seth Redmore的更多文章

社区洞察

其他会员也浏览了

Large Concept Models (LCMs): A New Paradigm in AI Language Processing

A guide to build contextual RAG systems with hybrid search and reranking

How Large Language Models (LLMs) Work and How They Are Developed

Evaluating Large Language Models (LLMs): A Standard Set of Metrics for Accurate Assessment

How Irrelevant Retrieval Leads to Hallucination in RAG Models

Large Language Models

Finetuning Large Language Models: A Comprehensive Guide

From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models

Most Companies Use LLMs Wrong. Here’s Why

Identification

Tokenization

Sentence Breaking

PoS Tagging

Chunking

Syntax Parsing

Inter-Sentence Relationships

Summary

Seth Redmore的更多文章

Airport Series: Charlotte and Customer Complaints

I used Semantria to analyze all of Atlanta International Airport's Facebook data. Here's what I found.

Automation Armageddon – Fact or Fiction?

9 ways AI isn’t going to be like Hollywood

My Team is Hosting an Awesome Webinar!

Tay, the Teen Chatbot and Redmore’s Razor

NLP Explained In Five Minutes

Understanding Sentiment Analysis in 5 Minutes

The Royal Bank of Scotland versus The Vikings

Knowing Your Audience's Intent

社区洞察

其他会员也浏览了

Large Concept Models (LCMs): A New Paradigm in AI Language Processing

A guide to build contextual RAG systems with hybrid search and reranking

How Large Language Models (LLMs) Work and How They Are Developed

Evaluating Large Language Models (LLMs): A Standard Set of Metrics for Accurate Assessment

How Irrelevant Retrieval Leads to Hallucination in RAG Models

Large Language Models

Finetuning Large Language Models: A Comprehensive Guide

From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models

Most Companies Use LLMs Wrong. Here’s Why