登录查看更多内容

Text mining using vectors explained to business people

Federico Cesconi

Founder & CEO @sandsiv the number one CXM solution powered by ?? AI | Author | In love with NLP using transformers

发布日期: 2018年3月17日

In the last years, a lot of researchers have invested time and resources into a new approach to Natural Language Processing: forget linguistic algorithms, let's turn the words into mathematical vectors and plot them in a 3D space. Then, applying linear algebra, it allows us to perform specific tasks, such topic detection, sentiment analysis, etc. without even using a single code of pure linguistic algorithms. To nice to be true? Let's discover the magic word of vectors using a very simple example.

One of the pioneers in this area of linguistic computation is Tomas Mikolov. Thomas was one of many extremely smart researchers at Google (Today is still a smart researcher but at Facebook AI working on FastText). His original idea was quite nice and simple: create a model to be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. You can read the result of his research here.

How is possible that reducing words in vectors can solve typical business problems such: topic detection, sentiment analysis, new topics pop up, etc.?

Understanding word vectors

Before I start my explanation, please be aware the aim is to help business people better understand how word embedding works. I will not open new academic discussions on word embedding. It is a very high-level simple explanation. I will use an example created by Allison Parrish. Allison is a poet but also a programmer. She uses Python in a very 'art way'. You can find very interesting articles on her blog.

Let's start considering a very small group of words: 14 animals. What we are going to do are essentially two steps:

Turn word (name of animals) into vectors
Plot those vectors on a space - in our case a two dimensions space.

Turn word into vectors

The way we will turn the word to vectors in our case is purely subjective. In reality, this task is a much complex operation considering several approaches such: frequency of certain words in the corpus, term frequency - inverse document frequency, etc. As I said, in our case, we will follow a purely subjective approach: we consider animals as words - of course - and we will turn those words into vectors using two subjective attributes:

the cuteness (0-100) of the animal, based on my pure subjective feeling
the size (0-100) of the animal, based on my pure subjective ignorance ;-)

The values themselves are simply based on my own judgment. Your taste in cuteness and evaluation of size may differ significantly from mine. My goal is, however, to explain you the word to vectors, not to push you to agree with my own subjective judgment ;-)

This is a very simplistic and subjective way to turn words into vectors. Based on the limited space of those 14 animals, our task is to be able to find similarities among these words. We start to list the animal with the two numerical elements into a table, something like this:

Dolphin, cuteness = 60, size = 45

Lobster, cuteness = 2, size = 15

Unfortunately, it is not possible to publish images in LinkedIn articles, but you can see the full table on my blog here. Anyway, as you can see, we have already created vectors out of words, for instance:

Dolphin = (60,45), it means that v dolphin = (60,45)

Lobster = (2,15), v lobster = (2,15)

Of course, those are really simple vectors. Usually, vectors generated in word2vec or similar algorithms have tons of dimensions and are usually notated as a matrix. In our example, we have just two coordinates and a very limited space in a cartesian chart... but we can still do a lot of interesting operations, don't worry.

Plot those vectors on a space

Next step, we are going to plot the words as vectors in a space. In our case will be a limited space due to few words we want to use. Our table, in total, contains 14 words. The space we are going to create is limited by those 14 words. If we consider a bigger corpus, for instance, a collection of 10'000 customer feedback, we could potentially create a word space of 2'000-3'500 words. Much bigger than the 14 words of our example.

Despite having few words (animals) and a very small space, the values give us everything we need to make determinations about which animals are similar (at least, similar to the properties that we've subjectively included in the data). For instance, let's try to answer the following question: Which animal is most similar to a capybara? You could go through the values one by one and do the math to make that evaluation, but visualizing the data as points in 2-dimensional space makes finding the answer very intuitive. Again, have a look at my blog, click here.

The plot shows us that the closest animal to the capybara is the panda bear (again, in terms of its subjective size and cuteness). One way of calculating how "far apart" two points are is to find their Euclidean distance. This is simply the length of the line that connects the two points. The distance between "capybara" (70, 30) and "panda" (74, 40) is 11.18033 which is, for instance, less than the distance between "tarantula" and "elephant": 104.00691.

Modeling animals in this way has a few other interesting properties. For example, you can pick an arbitrary point in "animal space" and then find the animal closest to that point. If you imagine an animal of size 25 and cuteness 30, you can easily look at the space to find the animal that most closely fits that description: the chicken.

Reasoning visually, you can also answer questions like what's halfway between a chicken and an elephant? Simply draw a line from "elephant" to "chicken," mark off the midpoint and find the closest animal. (According to our chart, halfway between an elephant and a chicken is a horse.)

You can also ask: what's the difference between a hamster and a tarantula? According to our plot, it's about seventy-five units of cute (and a few units of size).

The relationship of "difference" is an interesting one because it allows us to reason about analogous relationships. In the chart below, I've drawn an arrow from "tarantula" to "hamster" (in blue), see here on my blog.

You can understand this arrow as being the relationship between a tarantula and a hamster, in terms of their size and cuteness (i.e., hamsters and tarantulas are about the same size, but hamsters are much cuter). In the same diagram, I've also transposed this same arrow (this time in red) so that its origin point is "chicken." The arrow ends closest to "kitten." What we've discovered is that the animal that is about the same size as a chicken but much cuter is... a kitten. To put it in terms of an analogy: tarantulas are to hamster as chickens are to kittens.

A sequence of numbers used to identify a point is called a vector, and the kind of math we've been doing so far is called linear algebra. (Linear algebra is surprisingly useful across many domains: It's the same kind of math you might do to, e.g., simulate the velocity and acceleration of a sprite in a video game.)

A set of vectors that are all part of the same data set is often called a vector space. The vector space of animals in this section has two dimensions, by which I mean that each vector in the space has two numbers associated with it (i.e., two columns in the spreadsheet). The fact that this space has two dimensions just happens to make it easy to visualize the space by drawing a 2D plot. But most vector spaces you'll work with will have more than two dimensions—sometimes many hundreds. In those cases, it's more difficult to visualize the "space," but the math works pretty much the same.

How is the power of vectors applied to NLP?

Now let's work together with a much complex vector space: I have 10'000 customer feedback and I want to discover what are the main driver of satisfaction and dissatisfaction mentioned by my clients. The process will follow, more or less, exactly what we did with our vector space of animals, except an initial phase where we will 'clean' the text.

Clean the text

In order for our vector space to be effective, we will reduce all works in our corpus as tokens. For instance, the sentence 'I like your product' will be reduced to token 'I', 'like', 'your', 'product'. It will make our life easier to 'clean' the corpus. We will also remove useless noisy words such 'the', 'a', 'an', 'and', etc. We will remove the punctuation as well such '.', ';', '!', etc.

At this point, we will have a long dataset of tokens. One of the problems we will have with those tokens is the ambiguity. A token (word) can assume different meaning according to the context: 'I made the program run' and 'I run'. The token run is exactly the same but has two different meaning. To avoid this ambiguity we can use a Part of Speech algorithm. An algorithm that will identify the two RUN as different part of the speech. We will then keep it separate for the rest of the analysis.

At this point, we will try to 'summarize' the token. In few words, we will reduce them grouping together reducing the noise. We will use a specific family of algorithms to do that: lemmatization. Lemmatization reduces names at a singolar form, verbs at infinitive forms. etc. We will have then a long list of tokens (words). It will be a group similar to our 14 animals but, of course, bigger. Probably we will end with something like 3'000 tokens.

Now it is time to turn them into vectors. I will not go in details here. As I said before, there are a lot of methodologies to do that. If you are really interested, there are a couple of Python libraries able to do that. For instance, scikit-learn contains a specific TfidfVectorizer function able to turn tokens (words) to vectors, click here to get more details.

We will end up with a set of vectors and we will plot them on a 3D space. The difference between our animals example is the space: 2D vs 3D. Well, that is the level of sophistication we use to work in reality. The vectors are now complex matrix, not more two points in a 2D space. But the logic is similar, don't worry.

Now we can apply exactly the same algebra operations we have applied to our animals. We can, for instance, clustering vectors together selecting a specific point in our 3D space. It will allow us to understand what kind of topics clients are talking about. Or we can combine them with a sentiment analysis and see the difference between the positive sentiment space and the negative one.

We can, as well, use the vectors to train a neural network able to categorize the feedback in specific topics or assign them multiple labels to define multi-topic inside the same sentence.

Conclusion

My aim in this post was to explain in very simple words how turning a corpus into vectors allowing Natural language Processing. Once words become vectors can be analyzed using linear algebra, machine learning, and deep machine learning algorithms. It is a new and fascinating word opening a lot of promising scenarios in Natural Language Processing. I hope you are now in a position to better understanding the power of vectors in NLP.

要查看或添加评论，请登录

Federico Cesconi的更多文章

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

2025年2月18日

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

I've been fascinated by a peculiar discovery in the AI world lately. It turns out that when it comes to playing chess…
The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

2025年2月4日

The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…

2 条评论
NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

2025年1月31日

NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

On Monday, January 27th, NVIDIA experienced one of the largest single-company value drops in the history of capitalism…

2 条评论
The Market God Meets AI: A Crisis of Faith in Tech

2025年1月30日

The Market God Meets AI: A Crisis of Faith in Tech

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…
USA vs China: How AI is Ending Capitalism as We Know It

2025年1月20日

USA vs China: How AI is Ending Capitalism as We Know It

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…

2 条评论
The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

2025年1月2日

The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

For those of us working in Artificial Intelligence, it was never a question of if, but when open-source models would…

1 条评论
AI Agents: is SAAS really dead?

2024年12月31日

AI Agents: is SAAS really dead?

When Microsoft CEO Satya Nadella declared "SaaS is Dead," it sent shockwaves through the tech industry. But is…

9 条评论
Microsoft's Phi-4: Why Small Might Be the New Big in AI

2024年12月20日

Microsoft's Phi-4: Why Small Might Be the New Big in AI

Remember when everyone thought bigger was better in AI? Microsoft just flipped that script with Phi-4, and it's making…
OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

2024年12月16日

OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

Ever wondered what's really behind OpenAI's latest $200/month ChatGPT subscription? A recent deep dive into OpenAI's…
Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

2024年12月4日

Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

In today’s customer-centric business environment, the Net Promoter Score (NPS) has transcended its role as just another…

7 条评论

See all articles

Text mining using vectors explained to business people

Federico Cesconi

Founder & CEO @sandsiv the number one CXM solution powered by ?? AI | Author | In love with NLP using transformers

Understanding word vectors

Turn word into vectors

Plot those vectors on a space

How is the power of vectors applied to NLP?

Clean the text

Conclusion

Federico Cesconi的更多文章

社区洞察

其他会员也浏览了

AI in web scraping: a game changer?

Understanding TF-IDF

Association Rule Mining on Demonetization

Sentiment Analysis [Python]: Text Mining (NLP) Keras Tokenizer 2020 ??

What is Beam Search?

Machine Translation NLP Challenge

Movie Script Generator (based on GPT-2)

Latent Dirichlet Allocation

Understanding word vectors

Turn word into vectors

Plot those vectors on a space

How is the power of vectors applied to NLP?

Clean the text

Conclusion

Federico Cesconi的更多文章

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

The Market God Meets AI: A Crisis of Faith in Tech

USA vs China: How AI is Ending Capitalism as We Know It

The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

AI Agents: is SAAS really dead?

Microsoft's Phi-4: Why Small Might Be the New Big in AI

OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

社区洞察

其他会员也浏览了

AI in web scraping: a game changer?

Understanding TF-IDF

Association Rule Mining on Demonetization

Sentiment Analysis [Python]: Text Mining (NLP) Keras Tokenizer 2020 ??

What is Beam Search?

Machine Translation NLP Challenge

Movie Script Generator (based on GPT-2)

Latent Dirichlet Allocation