登录查看更多内容

What is the difference between keyword search and text mining?

Federico Cesconi

Founder & CEO @sandsiv the number one CXM solution powered by ?? AI | Author | In love with NLP using transformers

发布日期: 2017年9月29日

In my daily work I always get from my customers questions such: "Is it your text mining solution able to read specific words?", "How does it deal with sarcasm?", "How about if a word is misspelled or contains a typo?" and many others. Unfortunately most of people not familiar with text mining confuses keywords (or words) search with text mining. Those are two totally different words.

When you use keyword (word) search - basically searching words in a corpus - you type some words in a search engine and the software brings back one or more documents that contains those words. Each hit correspond to one document and typically you need to read a document to decide if is relevant or not. So, if you have a 1'000 hits, you need to read a 1'000 documents.

At the other end, text mining software is able to "read" and "interpret" the meaning of data inside the document. It identifies concepts and relationship. It presents the results back to you in a structure form. And the result are fragments of text that correspond to facts, associations or relationships. You only need to read the document once you find the relevant hit.

I personally think this confusion has been generated by vendors - especially in CXM space - setting wrong expectations. Common wrong expectations are:

With a click of mouse you get all topics out of a big data set (corpus) of customer feedbacks.
It is magic, you don't need to do anything, the software itself will understand all topics, sentiment and correlations inside the text (corpus). Don't believe in this bullshit even if the guy is telling to you is called Watson ;-)

Unfortunately this is not the case even if - especially for the first point - we are not far from a solution. The hilarious fact is - as I said - many CXM vendors selling keyword search as text mining ...and I can tell you, they put a very high price tag for that gimmick!

Why keyword search is not the right way to identify topics in a customer feedback corpus? As I said before "...typically you need to read a document to decide if is relevant or not." It turns into a nightmare: the quality of your classification will be horrible, and maintaining the rules to identify topics by keywords will be an ever ending sad story.

With "real text mining", especially deep machine learning, you will be able to achieve near the "push the button" unsupervised process to identify topics. The process, at very high level, will be as "easy" as 3 steps:

"Clean" your corpus, using specific linguistic algorithms such: tokenisation, part-of-speech, stop words, disambiguation, lemmatisation, etc.
Turn words in numbers (vectors) using different techniques: for instance continuous "bag of words", "sparse matrix", "tf-idf matrix", etc. Reason to turn words into numbers is deep machine learning (neural networks) loves numbers.
Apply specific deep machine learning techniques such: word2vec, Latent Dirichlet Assosciation (LDA), K-means clustering, etc. to automatically discover topics and sentiment. All those techniques are "code-able" in libraries such: Tensorflow, Gensim, Glove, Spy.cy, etc.

This approach allows you to answer to some of the specific requests mentioned before:

The computer will not 'read' words, it will turn into vectors and understand them in a mathematical way (e.g. word2vec, lda2vec, etc.)
The approach is language agnostic: it will analyse English the same way as Swiss-German, Arabic or Japanese ...and I can assure you, this is a big advantage.
The solution will solve the problem of misspelled words: the vector of "my father is coming home" or "my father is coming ome" is so close that there will be no difference as input for a neural network.
New words and concepts popping up will not be a problem: new vectors will appear in the vector space, you will be able to identify them and follow them in a trend.
What we call "multitopics" in the same sentence such "Your product is great but your customer care sucks!" can be easily detected by LDA models reporting also the right sentiment.

Easy as 1, 2 and 3 ...well easy if you are working with the right partner ;-)

Stat Reviewer

Statistical and Medical Informatics Reviewer

6 年

Good Article. The following article also will be helpful Tutorial on Mining of Biomedical Literature with the Help of R Package www.vinaitheerthan.com/research.php ?

Andrew Tucker, Ph.D.

Co-Founder, CEO at Mettle Capital

7 年

I agree that the distinction between keyword search and text mining isn't clear in clients' minds. As a result, we haven't yet realised the full potential of the latter. As a specialist group, we need to be showing what is possible with text mining. For example, my team look at what drives trust in brands to predict market share growth. Perhaps your next post could cover standout examples of where businesses use text mining? Happy to contribute if that helps?

1 次回应

Walter Mitchell

Maritime Domain Expert | Analysis and Advisory

7 年

SparkCognition's DeepNLP would be very useful in this app

Patrizia Alfiero

Head of Pricing Analytics - Poste Italiane

7 年

Grazie Federico! Un concentrato di tecnica e saggezza

1 次回应

Brigitte Kobi

Stilettissimo - Numbered Luxury - The Next Level of Footwear - Swiss Design - Italian Artistry - Uncompromising Excellence - The Right Shoe Is Glamour

7 年

Very clearly explained. Thank you, Federico Cesconi

1 次回应

查看更多评论

要查看或添加评论，请登录

Federico Cesconi的更多文章

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

2025年2月18日

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

I've been fascinated by a peculiar discovery in the AI world lately. It turns out that when it comes to playing chess…
The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

2025年2月4日

The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…

2 条评论
NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

2025年1月31日

NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

On Monday, January 27th, NVIDIA experienced one of the largest single-company value drops in the history of capitalism…

2 条评论
The Market God Meets AI: A Crisis of Faith in Tech

2025年1月30日

The Market God Meets AI: A Crisis of Faith in Tech

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…
USA vs China: How AI is Ending Capitalism as We Know It

2025年1月20日

USA vs China: How AI is Ending Capitalism as We Know It

Disclaimer: The views and opinions expressed in this post are my own and do not represent the official position or…

2 条评论
The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

2025年1月2日

The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

For those of us working in Artificial Intelligence, it was never a question of if, but when open-source models would…

1 条评论
AI Agents: is SAAS really dead?

2024年12月31日

AI Agents: is SAAS really dead?

When Microsoft CEO Satya Nadella declared "SaaS is Dead," it sent shockwaves through the tech industry. But is…

9 条评论
Microsoft's Phi-4: Why Small Might Be the New Big in AI

2024年12月20日

Microsoft's Phi-4: Why Small Might Be the New Big in AI

Remember when everyone thought bigger was better in AI? Microsoft just flipped that script with Phi-4, and it's making…
OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

2024年12月16日

OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

Ever wondered what's really behind OpenAI's latest $200/month ChatGPT subscription? A recent deep dive into OpenAI's…
Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

2024年12月4日

Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

In today’s customer-centric business environment, the Net Promoter Score (NPS) has transcended its role as just another…

7 条评论

See all articles

What is the difference between keyword search and text mining?

Federico Cesconi

Founder & CEO @sandsiv the number one CXM solution powered by ?? AI | Author | In love with NLP using transformers

Federico Cesconi的更多文章

社区洞察

其他会员也浏览了

Data Mining for Iron Ore

The Art Of Gathering Competitive Intelligence Insight: Beyond Data Mining. Competitive Intelligence 101, Part 2.

DATA MINING PROCESS

Why Mining Unstructured Supply Chain Data is a Goldmine

How can AI 'Knowledge Mining' accelerate your business?

Revealing efficiency and unlocking value with data and process mining

The insidious threat of data mining bias

Syrian Monetary Policy Communication (2011-2020): Text Mining Analysis

A Guide to Text Mining and Sentiment Analysis

Data Mining and the top-tier companies use them

Federico Cesconi的更多文章

What Playing Chess Taught Me About AI's Hidden Talents (And Limitations)

The DeepSeek Distillation Debate: Analysing OpenAI's Copyright Claims

NVIDIA's $600B Crash: A Deep Dive into Market Misunderstandings and the AI Computing Landscape

The Market God Meets AI: A Crisis of Faith in Tech

USA vs China: How AI is Ending Capitalism as We Know It

The Prophecy Fulfilled? Open Source LLM Overtake Commercial AI Giants (DeepSeek V3)

AI Agents: is SAAS really dead?

Microsoft's Phi-4: Why Small Might Be the New Big in AI

OpenAI's Strategic Chess Game: From o1 to Orion (And Your $200/Month)

Understanding NPS Movement: A Deep Dive into Customer Experience Analytics

社区洞察

其他会员也浏览了

Data Mining for Iron Ore

The Art Of Gathering Competitive Intelligence Insight: Beyond Data Mining. Competitive Intelligence 101, Part 2.

DATA MINING PROCESS

Why Mining Unstructured Supply Chain Data is a Goldmine

How can AI 'Knowledge Mining' accelerate your business?

Revealing efficiency and unlocking value with data and process mining

The insidious threat of data mining bias

Syrian Monetary Policy Communication (2011-2020): Text Mining Analysis

A Guide to Text Mining and Sentiment Analysis

Data Mining and the top-tier companies use them