登录查看更多内容

Machine Learning: Breaking Down a Buzz Phrase

Seth Redmore

发布日期: 2016年2月8日

What is Machine Learning

First and foremost, what is machine learning, and why is it a good thing? Machine learning is a set of statistical and mathematical tools and algorithms for training a computer to perform a specific task, for example, recognizing faces.

It's becoming a bit of a buzz phrase, so let's break it down a bit. There are two important words here―“training” and “statistical.” Training because you are literally teaching the computer about a particular task. Statistical because the computer is working with probabilistic math, and the chances of it getting the answer “correct” varies with the type and complexity of the question that it’s being trained to answer.

There are a number of different types of machine learning algorithms, from the simple “Na?ve Bayes” to “Neural Networks” to “Maximum Entropy” and “Decision Trees.” I'm more than happy to geek on out with you with respect to advantages and disadvantages of different types (hit me up in the comments or on Twitter). We can talk about linear vs. non-linear learning, feed-forward systems, or argue about multi-layer hidden networks vs. explicitly exposing each layer.

The Birds Eye View

But for now let's keep this high-level. If you want to dig in a bit deeper try reading these two white papers my company put out: Lexalytics loves machine learningand Build vs. buy. It'll take you one level down from here.

I work at Lexalytics, a machine learning company. We maintain dozens of both supervised and unsupervised machine learning models. (Close to 40, actually.) We have dozens of person-years dedicated to gathering data sets, experimenting with the state of the art machine learning algorithms, and producing models that balance accuracy, broad applicability, and speed. This combined effort has leant me an interesting perspective on the topic of machine learning. I'm gonna leap frog off of it for a few examples of practical machine learning applications.

See, Lexalytics is not a general-purpose machine learning company. We are not providing you with generic algorithms that can be tuned for any machine-learning problem. We are entirely, completely, and totally focused on text. All of our machine learning algorithms, models, and techniques are optimized to help you understand the meaning of text content.

Text content requires special approaches from a machine learning perspective, in that it can have hundreds of thousands of potential dimensions to it (words, phrases, etc), but tends to be very sparse in nature (say you’ve got 100,000 words in common use in the English language, in any given tweet you’re only going to get, say, 10-12 of them). This differs from something like video content where you have very high dimensionality, but you have oodles and boodles of data to work with, so, it’s not quite as sparse.

Concepts

Why is this an issue in machine learning? Because how can you start grouping things together and seeing trends unless you can understand the similarities between content.

In order to deal with the specific complications of text, we use what’s called a “hybrid” approach. Meaning, that unlike pure-play machine learning companies, we use a combination of machine learning, lists, pattern files, dictionaries, and natural language algorithms. In other words, rather than just having a variety of hammers (different machine learning algorithms), we have a nice tool belt full of different sorts of tools, each tool optimal for the task at hand.

Deep Learning

The “term du jour” seems to be “deep learning” – which is an excellent rebranding of “neural networks.” Basically, the way that deep learning works is that there are several layers that build up on top of each other in order to recognize a whole. For example, if dealing with a picture, layer 1 would see a bunch of dots, layer 2 would recognize a line, layer 3 would recognize corners connecting the lines, and the top layer would recognize that this is a square.

This explanation is an abstraction of what happens inside of deep learning for text – the internal layers are opaque math. We have taken a different approach that we believe to be superior to neural networks/deep learning – explicitly layered extraction. We have a multi-layered process for preparing the text that helps reduce the sparseness and dimensionality of the content – but as opposed to the hidden layers in a deep learning model, our layers are explicit and transparent. You can get access to every one of them and understand exactly what is happening at each step.

The Anatomy of Machine Learning

To give an idea of the machine learning models we have, just to process a document in English, we have the following machine-learning models:

Part of Speech tagging
Chunking
Sentence Polarity
Concept Matrix (Semantic Model)
Syntax Matrix (Syntax Parsing)

All of those models help us deal with that dimensionality/sparseness problem listed above. Now, we have to actually extract stuff, so, we’ve got additional models for

Named Entity Extraction
Anaphora Resolution (Associating pronouns with the right words)
Document Sentiment
Intention Extraction
Categorization

For other languages, like Mandarin Chinese, we have to actually figure out what a word is, so, we need to “tokenize” – which is another machine learning task. English is really easy – see all those spaces? That makes it really simple to tokenize – in other words, to determine what’s a word. So we just use a simple set of rules for English tokenization. On the other hand: 中国有没有空格，所以机器学习是符号化的重要。(Read: Chinese has no whitespace, so machine learning is important for tokenization.)

Some of our customers, particularly in the market analytics space and the customer experience management space have been hand-coding categories of content for years. This means that they have a lot of content that is bucketed into different categories. Which means that they have a really great set of content for training a machine-learning based classifier – we can do that for you too!

Making Machine Learning Efficient

But, and this is a really big but, it is inefficient to do all tasks with the same tool. That’s why we also have dictionaries and pattern files, and all sorts of other good stuff like that. To sum up why we use a hybrid approach, let’s take the following example… Say you’ve trained up a sentiment classifier using 50,000 documents that does a pretty good job of agreeing with a human as to whether something is positive, negative, or neutral. Awesome!

What happens when a review comes in that it scores incorrectly? There are 2 approaches – sometimes you have a feedback loop, and sometimes you have to collect a whole corpus of content and retrain the model.

Even in the case of the feedback loop, the behavior of the model isn’t going to change immediately, and it can be unpredictable – because you’re just going to tell it “this document was scored incorrectly, it should be positive” – and the model is going to take all of the words into account that are actually in the model itself.

In other words, it’s like you’ve got a big ocean liner. You can start to turn it, but it’s going to take a while and a lot of feedback before it turns. In our approach, you simply look to see what phrases were marked positive and negative, change them as appropriate, and then you’re done. The behavior changes instantly.

We like to think of it as the best of both worlds, and we think you will too.

要查看或添加评论，请登录

Seth Redmore的更多文章

Airport Series: Charlotte and Customer Complaints

2018年4月9日

Airport Series: Charlotte and Customer Complaints

More than 40 million people travel through North Carolina’s Charlotte Douglas International Airport each year, and it…
I used Semantria to analyze all of Atlanta International Airport's Facebook data. Here's what I found.

2018年1月11日

I used Semantria to analyze all of Atlanta International Airport's Facebook data. Here's what I found.

So, here’s the thing: few people are happy when they’re in airports. Whether it’s for business or pleasure, packing…
Automation Armageddon – Fact or Fiction?

2017年9月19日

Automation Armageddon – Fact or Fiction?

Automation is transforming our economy. Job losses over the coming decades may be as high as 47 percent, some analysts…
9 ways AI isn’t going to be like Hollywood

2017年9月12日

9 ways AI isn’t going to be like Hollywood

When Hollywood isn’t doing comic book franchises, it’s doing AI. Why? Because AI gives us a window into our own souls…
My Team is Hosting an Awesome Webinar!

2016年5月13日

My Team is Hosting an Awesome Webinar!

We're going to be running a no nonsense, straight to the point webinar on Thursday, May 19th at 1:00-1:30PM EDT. It's…
Low Level Text Analytics in 7 Minutes

2016年4月26日

Low Level Text Analytics in 7 Minutes

From entity extraction to document summary, text analytics is a combination of machine learning and natural language…

17 条评论
Tay, the Teen Chatbot and Redmore’s Razor

2016年4月13日

Tay, the Teen Chatbot and Redmore’s Razor

When Microsoft launched an “artificial intelligence” chatbot, or Tay, with the personality of a teenage girl, on a…

2 条评论
NLP Explained In Five Minutes

2016年3月30日

NLP Explained In Five Minutes

The Foundation As you might be able to tell by now, I'm interested in where data analytics and marketing intersect. But…
Understanding Sentiment Analysis in 5 Minutes

2016年3月29日

Understanding Sentiment Analysis in 5 Minutes

Basics Alright, for starters: Sentiment Analysis is the process of determining whether a piece of text is positive…

3 条评论
The Royal Bank of Scotland versus The Vikings

2016年2月26日

The Royal Bank of Scotland versus The Vikings

Every year the world spins closer to streams, allowing consumers everywhere to individually curate the media they come…

See all articles

Machine Learning: Breaking Down a Buzz Phrase

Seth Redmore

What is Machine Learning

The Birds Eye View

Concepts

Deep Learning

The Anatomy of Machine Learning

Making Machine Learning Efficient

Seth Redmore的更多文章

社区洞察

其他会员也浏览了

Artificial Intelligence #5 : A taxonomy of machine learning and deep learning algorithms

AI and Machine Learning

Types and Application of Machine Learning Algorithms

MLOps - An Overview

Artificial Intelligence #48: How do we combine statistical thinking and machine learning?

World of Machine Learning

Data Science Explained!

Why you should add statistical learning to your machine learning tool kit

How to learn machine learning and deep learning (AI) based on your high school maths

How AI learns

What is Machine Learning

The Birds Eye View

Concepts

Deep Learning

The Anatomy of Machine Learning

Making Machine Learning Efficient

Seth Redmore的更多文章

Airport Series: Charlotte and Customer Complaints

I used Semantria to analyze all of Atlanta International Airport's Facebook data. Here's what I found.

Automation Armageddon – Fact or Fiction?

9 ways AI isn’t going to be like Hollywood

My Team is Hosting an Awesome Webinar!

Low Level Text Analytics in 7 Minutes

Tay, the Teen Chatbot and Redmore’s Razor

NLP Explained In Five Minutes

Understanding Sentiment Analysis in 5 Minutes

The Royal Bank of Scotland versus The Vikings

社区洞察

其他会员也浏览了

Artificial Intelligence #5 : A taxonomy of machine learning and deep learning algorithms

AI and Machine Learning

Types and Application of Machine Learning Algorithms

MLOps - An Overview

Artificial Intelligence #48: How do we combine statistical thinking and machine learning?

World of Machine Learning

Data Science Explained!

Why you should add statistical learning to your machine learning tool kit

How to learn machine learning and deep learning (AI) based on your high school maths

How AI learns