Machine Learning: Breaking Down a Buzz Phrase

Machine Learning: Breaking Down a Buzz Phrase

What is Machine Learning

First and foremost, what is machine learning, and why is it a good thing? Machine learning is a set of statistical and mathematical tools and algorithms for training a computer to perform a specific task, for example, recognizing faces.

It's becoming  a bit of a buzz phrase, so let's break it down a bit. There are two important words here―“training” and “statistical.” Training because you are literally teaching the computer about a particular task. Statistical because the computer is working with probabilistic math, and the chances of it getting the answer “correct” varies with the type and complexity of the question that it’s being trained to answer.

There are a number of different types of machine learning algorithms, from the simple “Na?ve Bayes” to “Neural Networks” to “Maximum Entropy” and “Decision Trees.” I'm more than happy to geek on out with you with respect to advantages and disadvantages of different types (hit me up in the comments or on Twitter). We can talk about linear vs. non-linear learning, feed-forward systems, or argue about multi-layer hidden networks vs. explicitly exposing each layer.

The Birds Eye View

But for now let's keep this high-level. If you want to dig in a bit deeper try reading these two white papers my company put out: Lexalytics loves machine learningand Build vs. buy. It'll take you one level down from here.

I work at Lexalytics, a machine learning company. We maintain dozens of both supervised and unsupervised machine learning models. (Close to 40, actually.) We have dozens of person-years dedicated to gathering data sets, experimenting with the state of the art machine learning algorithms, and producing models that balance accuracy, broad applicability, and speed. This combined effort has leant me an interesting perspective on the topic of machine learning. I'm gonna leap frog off of it for a few examples of practical machine learning applications.

See, Lexalytics is not a general-purpose machine learning company. We are not providing you with generic algorithms that can be tuned for any machine-learning problem. We are entirely, completely, and totally focused on text. All of our machine learning algorithms, models, and techniques are optimized to help you understand the meaning of text content. 

Text content requires special approaches from a machine learning perspective, in that it can have hundreds of thousands of potential dimensions to it (words, phrases, etc), but tends to be very sparse in nature (say you’ve got 100,000 words in common use in the English language, in any given tweet you’re only going to get, say, 10-12 of them). This differs from something like video content where you have very high dimensionality, but you have oodles and boodles of data to work with, so, it’s not quite as sparse.

Concepts

Why is this an issue in machine learning? Because how can you start grouping things together and seeing trends unless you can understand the similarities between content.

In order to deal with the specific complications of text, we use what’s called a “hybrid” approach. Meaning, that unlike pure-play machine learning companies, we use a combination of machine learning, lists, pattern files, dictionaries, and natural language algorithms. In other words, rather than just having a variety of hammers (different machine learning algorithms), we have a nice tool belt full of different sorts of tools, each tool optimal for the task at hand.

Deep Learning

The “term du jour” seems to be “deep learning” – which is an excellent rebranding of “neural networks.” Basically, the way that deep learning works is that there are several layers that build up on top of each other in order to recognize a whole. For example, if dealing with a picture, layer 1 would see a bunch of dots, layer 2 would recognize a line, layer 3 would recognize corners connecting the lines, and the top layer would recognize that this is a square.

This explanation is an abstraction of what happens inside of deep learning for text – the internal layers are opaque math. We have taken a different approach that we believe to be superior to neural networks/deep learning – explicitly layered extraction. We have a multi-layered process for preparing the text that helps reduce the sparseness and dimensionality of the content – but as opposed to the hidden layers in a deep learning model, our layers are explicit and transparent. You can get access to every one of them and understand exactly what is happening at each step.

The Anatomy of Machine Learning

To give an idea of the machine learning models we have, just to process a document in English, we have the following machine-learning models:

  • Part of Speech tagging
  • Chunking
  • Sentence Polarity
  • Concept Matrix (Semantic Model)
  • Syntax Matrix (Syntax Parsing)

All of those models help us deal with that dimensionality/sparseness problem listed above. Now, we have to actually extract stuff, so, we’ve got additional models for

  • Named Entity Extraction
  • Anaphora Resolution (Associating pronouns with the right words)
  • Document Sentiment
  • Intention Extraction
  • Categorization

For other languages, like Mandarin Chinese, we have to actually figure out what a word is, so, we need to “tokenize” – which is another machine learning task. English is really easy – see all those spaces? That makes it really simple to tokenize – in other words, to determine what’s a word. So we just use a simple set of rules for English tokenization. On the other hand: 中国有没有空格,所以机器学习是符号化的重要。(Read: Chinese has no whitespace, so machine learning is important for tokenization.)

Some of our customers, particularly in the market analytics space and the customer experience management space have been hand-coding categories of content for years. This means that they have a lot of content that is bucketed into different categories. Which means that they have a really great set of content for training a machine-learning based classifier – we can do that for you too!

Making Machine Learning Efficient

But, and this is a really big but, it is inefficient to do all tasks with the same tool. That’s why we also have dictionaries and pattern files, and all sorts of other good stuff like that. To sum up why we use a hybrid approach, let’s take the following example… Say you’ve trained up a sentiment classifier using 50,000 documents that does a pretty good job of agreeing with a human as to whether something is positive, negative, or neutral. Awesome!

What happens when a review comes in that it scores incorrectly? There are 2 approaches – sometimes you have a feedback loop, and sometimes you have to collect a whole corpus of content and retrain the model.

Even in the case of the feedback loop, the behavior of the model isn’t going to change immediately, and it can be unpredictable – because you’re just going to tell it “this document was scored incorrectly, it should be positive” – and the model is going to take all of the words into account that are actually in the model itself.

In other words, it’s like you’ve got a big ocean liner. You can start to turn it, but it’s going to take a while and a lot of feedback before it turns. In our approach, you simply look to see what phrases were marked positive and negative, change them as appropriate, and then you’re done. The behavior changes instantly.

We like to think of it as the best of both worlds, and we think you will too.

要查看或添加评论,请登录

Seth Redmore的更多文章

社区洞察

其他会员也浏览了